home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
kermit.columbia.edu
/
kermit.columbia.edu.tar
/
kermit.columbia.edu
/
newsgroups
/
misc.20000114-20000217
/
000123_news@columbia.edu _Sun Jan 23 19:57:18 2000.msg
< prev
next >
Wrap
Internet Message Format
|
2000-02-16
|
9KB
Return-Path: <news@columbia.edu>
Received: from newsmaster.cc.columbia.edu (newsmaster.cc.columbia.edu [128.59.59.30])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA27636
for <kermit.misc@watsun.cc.columbia.edu>; Sun, 23 Jan 2000 19:57:18 -0500 (EST)
Received: (from news@localhost)
by newsmaster.cc.columbia.edu (8.8.5/8.8.5) id TAA19711
for kermit.misc@watsun.cc.columbia.edu; Sun, 23 Jan 2000 19:28:34 -0500 (EST)
X-Authentication-Warning: newsmaster.cc.columbia.edu: news set sender to <news> using -f
From: fdc@watsun.cc.columbia.edu (Frank da Cruz)
Subject: Case Study #14: Character Sets
Date: 24 Jan 2000 00:28:33 GMT
Organization: Columbia University
Message-ID: <86g6bh$j7s$1@newsmaster.cc.columbia.edu>
To: kermit.misc@columbia.edu
Some recent questions about character-sets prompt today's discussion. As
you probably know, Kermit software is practically (and perhaps actually)
unique among communication software packages in its ability to convert the
character sets of text files while transferring them between platforms that
use different ones. In the recent postings, the need was to transfer
Portuguese text between a PC that used PC Code Page 850 (CP850) and a UNIX
system that used some other encoding.
Kermit protocol and software have been able to handle such tasks since the
1980s. This feature is important to everybody who reads and writes a
language that uses accented and/or non-Roman characters -- in other words,
the overwhelming majority of humanity. Only a few languages are written
entirely in plain ABCs: English, Latin, Malay, and maybe Dutch. Nearly all
the others need accents or non-ABC characters. But accented and non-Roman
characters are represented differently on different computers. So
(returning to our example) if you copy Portuguese text from (say) DOS or
Windows to (say) HP-UX or VMS, all the accented letters become, well,
garbage. If you copy Greek, Russian, or Hebrew text between the same two
computers, ALL the letters become garbage.
What good is accomplished by moving text from one computer to another if the
result is gibberish? In the world at large, text-file transfer should
provide for character-set conversion. The Kermit protocol does; the method
was worked out in the late 1980s and is written up in papers you can find
at:
http://www.columbia.edu/kermit/papers.html
Suppose you want to send Portuguese text from DOS to HP-UX (in DOS,
Portuguese text can be encoded in CP437, CP850, or CP860, each of them
different; your first job is to find out which one is actually used on your
PC). Let's say the encoding is CP850. You would tell Kermit on the PC to:
set file character-set cp850
Use C-Kermit's menu-on-demand feature to find out what file character-sets
are available:
set file character-set ?
This gives you the complete list.
But PC Kermit doesn't send CP850 on the wire, because it's a private
(proprietary) character set. Only standard character sets should be used
between computers. Kermit supports a small number of standard transfer
character-sets, each one covering its own group of languages (and therefore
file character-sets). You have to tell it which one to use; in this case,
ISO 8859-1 Latin Alphabet 1:
set transfer character-set latin1
You can see the list of available transfer character-sets with:
set transfer character-set ?
(If you obtained the two lists, you should have seen about 50 file
character-sets and 10 transfer character-sets, enough to cover the West and
East European Roman-alphabet languages, plus languages written in Cyrillic
and Hebrew, plus Greek and Japanese.)
Now when PC Kermit sends a file in text mode, it converts the file from
CP850 to Latin-1, and announces the Latin-1 encoding to the receiving
Kermit program.
Meanwhile, because the HP-Roman8 character set is used on HP-UX, which is
different not only from the PC code pages just mentioned but also from
Latin-1, HP-UX C-Kermit must be told to:
set file character-set hp-roman8
The final step is to make sure the file sender transfers the file in text
mode, rather than binary mode, because character-set and record-format
conversions take place only in text mode:
set file type text
Now the file can be transferred. To summarize, the following commands are
given to the file sender:
set file character-set cp850 ; Identify the source file encoding
set transfer character-set latin1 ; Specify the transfer encoding
set file type text ; Choose text mode
send quilombo.txt ; Send a file
and to the file receiver:
set file character-set hp-roman8 ; Identify target file encoding
receive ; Receive the file
The file sender tells the file receiver to expect a text file encoded in
Latin-1; the file sender converts from CP850 to Latin-1, and the file
receiver converts from Latin-1 to HP-Roman8. To send files in the other
direction, simply exchange the SEND and RECEIVE commands (keeping the
SET FILE TYPE TEXT command with the file sender); the rest stays the same.
This is all old news, but it might still be new to many readers. The
procedures and specific character sets are documented in Chapter 16 of
"Using C-Kermit", 2nd Edition, and in other Kermit manuals. All of the
facilities discussed until now are found in C-Kermit 5A and later, MS-DOS
Kermit 3.0 and later, Kermit 95 (all versions), and IBM Mainframe Kermit
since (I think) version 4.1.
So what's new in C-Kermit 7.0 and the forthcoming 1.1.18 release of Kermit
95? Lots of new character sets have been added, including many for Eastern
Europe and the former Soviet Union, as well as those used for Greek. And
Unicode, the new Univeral Character Set, which was discussed in a previous
posting. So now the possibilities for character-set conversion are wider
than ever.
And in keeping with our goal that C-Kermit 7.0 "just work" for most people
most of the time, we have also added not just automatic text/binary mode
switching, discussed previously, but also automatic character-set
associations, in which each file character-set is associated with an
appropriate transfer character-set, and vice versa. C-Kermit comes with a
comprehensive table of associates preloaded, which you can view with:
show associations
Perhaps you were wondering (if you don't have a manual) how you were
supposed to know that Latin-1 was the appropriate transfer character-set for
CP850? Good question! Now this information is built in to C-Kermit. So
whenever you pick a file character-set, C-Kermit picks the appropriate
transfer character-set for you, and vice versa. Furthermore, whenever
C-Kermit receives a text file in a particular transfer character-set, it
converts it to the appropriate file character-set automatically, even if you
have not told it which one to use. So the sequence above is now simplified.
At the sender:
set file character-set cp850 ; Identify the source file encoding
send *.* ; Send some files
and at the file receiver:
receive ; Receive the file
Appropriate associations are built in for each platform. So you just have
to start the ball rolling by specifying the encoding of the source file; the
rest flows from there. And now because of automatic text/binary mode
switching, you can send a mixed group of text and binary files and have the
character-set conversions applied only to the text files.
Of course you can change associations if you need to. The command is
ASSOCIATE. You can also turn this whole feature on and off with SET SEND
(and RECEIVE) CHARACTER-SET-SELECTION. For complete details about
character-set associations, see Section 6.5 of the ckermit2.txt file.
So now C-Kermit is just about as automatic as it can be in this area. The
one thing it can't do is figure out automatically the encoding of a file.
Some people believe this can be done, but I'm not one of them. Operating
systems have nevere tagged files by encoding, and guessing the encoding from
inspection is highly unreliable.
By the way, C-Kermit's character-set conversion capabilities are not limited
to file transfer. They are also available in terminal (CONNECT) mode. In
this case you choose the translation with:
set terminal character-set <remote-set> [ <local-set> ]
The <local-set> defaults to C-Kermit's current file character-set. Again,
type a question mark in the character-set field to get a list of available
choices.
Finally, you can also use C-Kermit to convert a local file from one
character-set to another. For example, to convert the file oofa.txt from
Latin-1 to the UTF-8 form of Unicode, and store the result as oofa.utf8,
the command would be:
translate oofa.txt latin1 utf8 oofa.utf8
This is nothing new, except for the expanded character-set choices.
- Frank